A Visual Exploration and Statistical Analysis of a Diabetes Dataset using Python

By: Noureddin Sadawi, PhD

This Dataset is Freely Available

Overview:

The data was collected and made available by the "National Institute of Diabetes and Digestive and Kidney Diseases" as part of the Pima Indians Diabetes Database.

Diabetes.csv is available from Kaggle. We have several questions - what information is more correlated with a positive diagnosis, and if we can only ask two questions to a patient, what should we ask and how would we give them a risk of being diagnosed.

This is a machine learning database, and normally we'd just extract features, feed to a ML algorithm and sit back and relax. But we'll get our hands dirty so that you can learn more.

We'll be using Python and some of its popular data science related packages. First of all, we will import pandas to read our data from a CSV file and manipulate it for further use.

We will also use numpy to convert out data into a format suitable to feed our classification model.

We'll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build a classification model.

++++++++++++++++++++++++++++++++++++

The following features have been provided to help us predict whether a person is diabetic or not:

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)2)
  • DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
  • Age: Age (years)
  • Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

Scenario: Imagine you have collected this data and wish to analyse it

Import required libraries

Load the data and view first few rows

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148.0 72 35 NaN 33.6 0.627 50.0 1
1 1 85.0 66 29 NaN 26.6 0.351 31.0 0
2 8 183.0 64 0 NaN 23.3 0.672 32.0 1
3 1 89.0 66 23 94.0 28.1 0.167 21.0 0
4 0 137.0 40 35 168.0 43.1 2.288 33.0 1

1- Basic Exploration and Cleaning up

Missing and Zero Values

  • It is clear the data has missing and zero values
  • For example, we can see that SkinThickness = 0 in the third row
  • And we can also see some 'NaN' values
  • Sometimes missing values are represented by a '?'
  • Let's get more info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   763 non-null    float64
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
 8   Outcome                   768 non-null    int64  
dtypes: float64(5), int64(4)
memory usage: 54.1 KB

Viewing more details about the data

We can use the describe function and it will give us more details about the numeric columns

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 763.000000 768.000000 768.000000 394.000000 757.000000 768.000000 768.000000 768.000000
mean 3.845052 121.686763 69.105469 20.536458 155.548223 32.457464 0.471876 33.240885 0.348958
std 3.369578 30.535641 19.355807 15.952218 118.775855 6.924988 0.331329 11.760232 0.476951
min 0.000000 44.000000 0.000000 0.000000 14.000000 18.200000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 76.250000 27.500000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 125.000000 32.300000 0.372500 29.000000 0.000000
75% 6.000000 141.000000 80.000000 32.000000 190.000000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Dealing with Missing and Zero Values

YOU decide what to do!

3
Pregnancies                   0
Glucose                       5
BloodPressure                 0
SkinThickness                 0
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
(752, 8)

The Mean and Median

  • The mean is the simple mathematical average of a list of two or more numbers.
  • The median is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average.

https://www.investopedia.com/terms-beginning-with-m-4769363

Zero Values that Don't Make Sense

(532, 8)

Some Summary Information

Code to display table borders

Pregnancies Glucose BloodPressure SkinThickness BMI DiabetesPedigreeFunction Age Outcome
count 532.000000 532.000000 532.000000 532.000000 532.000000 532.000000 532.000000 532.000000
mean 3.516917 121.030075 71.505639 29.182331 32.890226 0.502966 31.614662 0.332707
std 3.312036 30.999226 12.310253 10.523878 6.881109 0.344546 10.761584 0.471626
min 0.000000 56.000000 24.000000 7.000000 18.200000 0.085000 21.000000 0.000000
25% 1.000000 98.750000 64.000000 22.000000 27.875000 0.258750 23.000000 0.000000
50% 2.000000 115.000000 72.000000 29.000000 32.800000 0.416000 28.000000 0.000000
75% 5.000000 141.250000 80.000000 36.000000 36.900000 0.658500 38.000000 1.000000
max 17.000000 199.000000 110.000000 99.000000 67.100000 2.420000 81.000000 1.000000
Pregnancies Glucose BloodPressure SkinThickness BMI DiabetesPedigreeFunction Age
Outcome
0 2.926761 110.016901 69.912676 27.290141 31.429577 0.446315 29.222535
1 4.700565 143.118644 74.700565 32.977401 35.819774 0.616588 36.412429
Pregnancies Glucose BloodPressure SkinThickness BMI DiabetesPedigreeFunction Age
mean median mean median mean median mean median mean median mean median mean median
Outcome
0 2.926761 2 110.016901 106.0 69.912676 70 27.290141 27 31.429577 30.9 0.446315 0.368 29.222535 25.0
1 4.700565 4 143.118644 144.0 74.700565 74 32.977401 32 35.819774 34.6 0.616588 0.542 36.412429 35.0

2- Useful and Informative Plots

Histogram Plots

Scatter Matrix

  • This one is a useful one liner ... but note that it only works with numeric data
  • If you want to include categorical data in there you should convert the categories into numeric labels

The scatter plot gives us both the histograms for the distributions along the diagonal, and also a lot of 2D scatter plots off-diagonal. Not that this is a symmetric matrix, so we normally just look at the diagonal and below/above it. We can see that some variables have a lot of scatter and some are correlated (ie there is a direction in their scatter). Which leads us to...

Correlation Plots

To easily quantify which variables / attributes are correlated with others!

https://www.statisticshowto.com/probability-and-statistics/correlation-analysis/

  • Correlation is usually an indicative of good information content
  • So in analysis you might want to use variables that are correlated with your outcome variable!
Pregnancies Glucose BloodPressure SkinThickness BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.000000 0.125330 0.204663 0.095085 0.008576 0.007435 0.640747 0.252586
Glucose 0.125330 1.000000 0.219178 0.226590 0.247079 0.165817 0.278907 0.503614
BloodPressure 0.204663 0.219178 1.000000 0.226072 0.307357 0.008047 0.346939 0.183432
SkinThickness 0.095085 0.226590 0.226072 1.000000 0.647422 0.118636 0.161336 0.254874
BMI 0.008576 0.247079 0.307357 0.647422 1.000000 0.151107 0.073438 0.300901
DiabetesPedigreeFunction 0.007435 0.165817 0.008047 0.118636 0.151107 1.000000 0.071654 0.233074
Age 0.640747 0.278907 0.346939 0.161336 0.073438 0.071654 1.000000 0.315097
Outcome 0.252586 0.503614 0.183432 0.254874 0.300901 0.233074 0.315097 1.000000

And you can see this is a symmetric matrix too. But it immedietly allows us to point out the most correlated and anti-correlated attributes. Some might just be common sense - Pregnancies v Age for example - but some might give us real insight into the data.

  • It allows to see what we need to investigate and which variables we can focus on if we do not have the time and resources to investigate every possible column

Covariance Plot

Covariance is a measure of how much two random variables vary together.

In other words, it is the measure of the joint spread between two random variables.

https://www.statisticshowto.com/covariance/

Pregnancies Glucose BloodPressure SkinThickness BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 10.969581 12.867664 8.344537 3.314235 0.195458 0.008485 22.837981 0.394549
Glucose 12.867664 960.952013 83.640131 73.921060 52.704249 1.771041 93.043626 7.362856
BloodPressure 8.344537 83.640131 151.542341 29.288047 26.035648 0.034132 45.961687 1.064975
SkinThickness 3.314235 73.921060 29.288047 110.752004 46.883706 0.430168 18.271898 1.265023
BMI 0.195458 52.704249 26.035648 46.883706 47.349659 0.358254 5.438223 0.976516
DiabetesPedigreeFunction 0.008485 1.771041 0.034132 0.430168 0.358254 0.118712 0.265684 0.037874
Age 22.837981 93.043626 45.961687 18.271898 5.438223 0.265684 115.811687 1.599256
Outcome 0.394549 7.362856 1.064975 1.265023 0.976516 0.037874 1.599256 0.222431

Box Plots, Violin Plots and Bee Swarm Plots

2D Histograms

Useful when you have a lot of data ... i.e. at least 1000's of points

See here for the API

Contour plots

A bit hard to get information from the 2D histogram isnt it? Too much noise in the image. What if we try and contour diagram? We'll have to bin the data ourself. The contour API is here

  • Contour plots help us make the correlation in the data a bit more obvious
  • But remember .. they require a lot of data
  • Looks like its just as noisy with the contour plot
  • In general, for 2D histograms and contour plots, have a lot of data
  • We simply don't have enough data to get smooth results!

KDE Plots

With these plots we smooth the data ourselves. Seaborn to the rescue!

  • KDE smooths the data for you and gives you an approximation
  • The smooth surface is not the real data, it is something we have done to it using a kernel density
  • A kernel specifies how we are smoothing the data
  • By default here we are using a Gaussian kernel
  • It takes the 2D histogram and convolves it with a kernel (in the shape of a Gaussian) to smear it

No Harm in keeping it Simple

A scatter plot is normally fairly informative and very fast to plot.

  • Looking at the scatter plot above, can we make some useful informed guesses?
  • Given the level of Glucose and BMI of a random person, can we guess if they are diabetic?

Treating Points with Probability

Using the library ChainConsumer (examples here). written by Samuel Hinton.

It is easy to install:

pip install chainconsumer

(355, 2)
  • It gives you two contours for each chain, one for the 68% confidence interval and the other is for the 95% confidence interval
  • This is useful when doing Hypothesis testing
  • For example, if you randomly pick a Diabetic person out of a different data sample, you can say that 68% of the time we would expect their BMI and Glucose levels to lie within the 68% contour, and 95% of the time their BMI and Glucose levels would lie in the second contour.
  • This is useful when you would like to check if a data point comes from this distrbution or not
  • For example, you can check where the data point lies and estimate the chance of it being of a Diabetic or Non Diabetic person
  • If we ignore the correlations, we can visually see that Diabetic individuals have higher Glucose levels that Non Diabetic individuals
  • They also tend to have a higher BMI (look at the tail of the curve)

3- A Probabilistic Analysis

Based on the previous Correlation plot, a simple approach might be just to use the top correlated variables and investigate them further. In our case, they're: Glucose, BMI and Age.

  • We will not use "Pregnancies" because we will not pay attention to Gender
  • And we will not use "DiabetesPedigreeFunction" because we do not expect patients to know its value when they come to a clinic!
(532, 4)

So its not perfect, but we can probably do an alright job approximating both these distributions as Gaussians.

Multivariate Normal Distribution

Allows us to model Glucose, BMI and Age as a normal in 3 dimensions rather than 3 independent normals. That way we can have correlations between them.

Plot the data in 3D and Add a Test Point

[4.815112456294995e-06, 2.4372535200453195e-06]
Number of people with Diabetes is:  177
Number of people without Diabetes is:  355
Negative diagnosis chance is 50.38%

Very Important to Notice:

  • You should notice how the probability is weighted.
  • The weighting is done to account for the imbalance (remember there are far more patients without diabetes than with)
  • We can only directly compare the two distributions if they have equal probability all up (same number of people with and without)
  • This might seem odd even if the test patient is right on the maximum of our model for the diabetes patients
  • This is rarely the case, and so we have to weight them

4- Predicting Diabetes with Logistic Regression

  • Logistic Regression measures the relationship between the dependent variable (our Outcome or what we want to predict) and one or more independent variables (our features or input variables), by estimating probabilities using its underlying logistic function
  • These probabilities must then be transformed into binary values in order to actually make a prediction
  • This is the task of the logistic function, also called the sigmoid function
  • The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits
  • The values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier

More here: https://machinelearning-blog.com/2018/04/23/logistic-regression-101/

Glucose BMI Age Outcome
0 148.0 33.6 50.0 1
1 85.0 26.6 31.0 0
3 89.0 28.1 21.0 0
4 137.0 43.1 33.0 1
6 78.0 31.0 26.0 1
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
accuracy =  83.33333333333334 %

Interpreting the Model

To get a better sense of what is going on inside the logistic regression model, we can visualize how our model uses the different features and which features have greater effect on the outcome

importance positive
Features
Glucose 0.034573 True
Age 0.047817 True
BMI 0.087623 True

Conclusions from the above figure:

  • BMI level has the most significant influence on the model (even more than Glucose) .. does this make sense?
  • The second highest influencer is age
  • In third place somes the Glucose level
  • The three of them have a positive influence on the prediction, i.e. their higher values are correlated with a person being diabetic
  • Correlation tells us that Glucose is more correlated than BMI to the Outcome, the model relies more on BMI. This can happen for several reasons, including the fact that the correlation captured by Glucose is also captured by some other variables, whereas the information captured by BMI is not captured by other variables.

Another look at Correlation

Predict the same test point from above

array([0.])
array([[0.54339335, 0.45660665]])

Notice:

  • Our logistic regression model agrees with our statistical model (the multivariate normal)
  • Good stuff!

Well Done!